<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Overview

🤗 Optimum provides an API called BetterTransformer, a fast path of standard PyTorch Transformer APIs to benefit from interesting speedups on CPU & GPU through sparsity and fused kernels as Flash Attention. For now, BetterTransformer supports the fastpath from the native [`nn.TransformerEncoderLayer`](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/) as well as Flash Attention and Memory-Efficient Attention from [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html).

## Quickstart

Since its 1.13 version, [PyTorch released](https://pytorch.org/blog/PyTorch-1.13-release/) the stable version of a fast path for its standard Transformer APIs that provides out of the box performance improvements for transformer-based models. You can benefit from interesting speedup on most consumer-type devices, including CPUs, older and newer versions of NIVIDIA GPUs.
You can now use this feature in 🤗 Optimum together with Transformers and use it for major models in the Hugging Face ecosystem.

In the 2.0 version, PyTorch includes a native scaled dot-product attention operator (SDPA) as part of `torch.nn.functional`. This function encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the [official documentation](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) for more information, and [this blog post](https://pytorch.org/blog/out-of-the-box-acceleration/) for benchmarks.

We provide an integration with these optimizations out of the box in 🤗 Optimum, so that you can convert any supported 🤗 Transformers model so as to use the optimized paths & `scaled_dot_product_attention` function when relevant.

<Tip warning={true}>
The PyTorch-native `scaled_dot_product_attention` operator can only dispatch to Flash Attention if no `attention_mask` is provided.

Thus, by default in training mode, the BetterTransformer integration **drops the mask support and can only be used for training that do not require a padding mask for batched training**. This is the case for example for masked language modeling or causal language modeling. BetterTransformer is not suited for the fine-tuning of models on tasks that requires a padding mask.

In inference mode, the padding mask is kept for correctness and thus speedups should be expected only in the batch size = 1 case.
</Tip>

### Supported models

The list of supported model below:

- [AlBERT](https://arxiv.org/abs/1909.11942)
- [Bark](https://github.com/suno-ai/bark)
- [BART](https://arxiv.org/abs/1910.13461)
- [BERT](https://arxiv.org/abs/1810.04805)
- [BERT-generation](https://arxiv.org/abs/1907.12461)
- [BLIP-2](https://arxiv.org/abs/2301.12597)
- [BLOOM](https://arxiv.org/abs/2211.05100)
- [CamemBERT](https://arxiv.org/abs/1911.03894)
- [CLIP](https://arxiv.org/abs/2103.00020)
- [CodeGen](https://arxiv.org/abs/2203.13474)
- [Data2VecText](https://arxiv.org/abs/2202.03555)
- [DistilBert](https://arxiv.org/abs/1910.01108)
- [DeiT](https://arxiv.org/abs/2012.12877)
- [Electra](https://arxiv.org/abs/2003.10555)
- [Ernie](https://arxiv.org/abs/1904.09223)
- [FSMT](https://arxiv.org/abs/1907.06616)
- [GPT2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- [GPT-j](https://huggingface.co/EleutherAI/gpt-j-6B)
- [GPT-neo](https://github.com/EleutherAI/gpt-neo)
- [GPT-neo-x](https://arxiv.org/abs/2204.06745)
- [HuBERT](https://arxiv.org/pdf/2106.07447.pdf)
- [LayoutLM](https://arxiv.org/abs/1912.13318)
- [MarkupLM](https://arxiv.org/abs/2110.08518)
- [Marian](https://arxiv.org/abs/1804.00344)
- [MBart](https://arxiv.org/abs/2001.08210)
- [M2M100](https://arxiv.org/abs/2010.11125)
- [OPT](https://arxiv.org/abs/2205.01068)
- [ProphetNet](https://arxiv.org/abs/2001.04063)
- [RemBERT](https://arxiv.org/abs/2010.12821)
- [RoBERTa](https://arxiv.org/abs/1907.11692)
- [RoCBert](https://aclanthology.org/2022.acl-long.65.pdf)
- [RoFormer](https://arxiv.org/abs/2104.09864)
- [Splinter](https://arxiv.org/abs/2101.00438)
- [Tapas](https://arxiv.org/abs/2211.06550)
- [ViLT](https://arxiv.org/abs/2102.03334)
- [ViT](https://arxiv.org/abs/2010.11929)
- [ViT-MAE](https://arxiv.org/abs/2111.06377)
- [ViT-MSN](https://arxiv.org/abs/2204.07141)
- [Wav2Vec2](https://arxiv.org/abs/2006.11477)
- [Whisper](https://cdn.openai.com/papers/whisper.pdf)
- [XLMRoberta](https://arxiv.org/abs/1911.02116)
- [YOLOS](https://arxiv.org/abs/2106.00666)

Let us know by opening an issue in 🤗 Optimum if you want more models to be supported, or check out the [contribution guideline](https://huggingface.co/docs/optimum/bettertransformer/tutorials/contribute) if you want to add it by yourself!

### Quick usage

In order to use the `BetterTransformer` API just run the following commands:

```python
>>> from transformers import AutoModelForSequenceClassification
>>> from optimum.bettertransformer import BetterTransformer
>>> model_hf = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
>>> model = BetterTransformer.transform(model_hf, keep_original_model=True)
```
You can leave `keep_original_model=False` in case you want to overwrite the current model with its `BetterTransformer` version.

More details on `tutorials` section to deeply understand how to use it, or check the [Google colab demo](https://colab.research.google.com/drive/1Lv2RCG_AT6bZNdlL1oDDNNiwBBuirwI-?usp=sharing)!


<div class="mt-10">
  <div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-2 md:gap-y-4 md:gap-x-5">
    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./tutorials/convert"
      ><div class="w-full text-center bg-gradient-to-br from-blue-400 to-blue-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Tutorials</div>
      <p class="text-gray-700">Learn the basics and become familiar with 🤗 and `BetterTransformer` integration. Start here if you are using 🤗 Optimum for the first time!</p>
    </a>
    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./tutorials/contribute"
      ><div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">How-to guides</div>
      <p class="text-gray-700">You want to add your own model for `BetterTransformer` support? Start here to check the contribution guideline!</p>
    </a>
  </div>
</div>
